Участник:Helgian/ Relative entropy
In probability theory and information theory, the Kullback–Leibler divergence }}S. Kullback (1959) Information theory and statistics (John Wiley and Sons, NY).S. Kullback (1987) The Kullback-Leibler distance, The American Statistician 41:340–341. (also information divergence, information gain, or relative entropy) is a non-symmetric measure of the difference between two probability distributions P'' and ''Q. KL measures the expected number of extra bits required to code samples from P'' when using a code based on ''Q, rather than using a code based on P''. Typically ''P represents the "true" distribution of data, observations, or a precise calculated theoretical distribution. The measure Q'' typically represents a theory, model, description, or approximation of ''P. Although it is often intuited as a distance metric, the KL divergence is not a true metric since it is not symmetric (hence 'divergence' rather than 'distance') and does not satisfy the triangle inequality. However, it is a premetric and therefore specifies a topology. Moreover, this topology strictly dominates the topology of the total variation due to Pinsker's inequality. KL divergence is a special case of a broader class of divergences called ''f''-divergences. Definition For probability distributions P'' and ''Q of a discrete random variable the K–L divergence of Q'' from ''P is defined to be : D_{\mathrm{KL}}(P\|Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}. \! For distributions P'' and ''Q of a continuous random variable the summations give way to integrals, so that : D_{\mathrm{KL}}(P\|Q) = \int_{-\infty}^{\infty} p(x) \log \frac{p(x)}{q(x)} \; dx, \! where p'' and ''q denote the densities of P'' and ''Q. More generally, if P'' and ''Q are probability measures over a set X'', and ''P is absolutely continuous with respect to Q'', then the Kullback-Leibler divergence from ''P to Q'' is defined as : D_{\mathrm{KL}}(P\|Q) = -\int_X \log \frac{d Q}{d P} \; dP, \! where \frac{dQ}{dP} is the Radon-Nikodym derivative of ''Q with respect to P, and provided the expression on the right-hand side exists. Likewise, if P'' is absolutely continuous with respect to ''Q, then : D_{\mathrm{KL}}(P\|Q) = \int_X \log \frac{dP}{dQ} \; dP = \int_X \frac{dP}{dQ} \log\frac{dP}{dQ}\; dQ , which we recognize as the entropy of P'' relative to ''Q. Continuing in this case, if \mu is any measure on X'' for which p = \frac{d P}{d \mu} and q = \frac{d Q}{d \mu} exist, then the Kullback-Leibler divergence from ''P to Q'' is given as : D_{\mathrm{KL}}(P\|Q) = \int_X p \log \frac{p}{q} \;d\mu. \! The logarithms in these formulae are taken to base 2 if information is measured in units of bits, or to base ''e if information is measured in nats. Most formulas involving the KL divergence hold irrespective of log base. Motivation, properties and terminology Gaussian distributions. Note the typical asymmetry for the KL divergence is clearly visible.]] In information theory, the Kraft-McMillan theorem establishes that any directly-decodable coding scheme for coding a message to identify one value xi out of a set of possibilities X'' can be seen as representing an implicit probability distribution ''q(xi) = 2-''li'' over X'', where ''li is the length of the code for xi in bits. Therefore, KL divergence can be interpreted as the expected extra message-length per datum that must be communicated if a code that is optimal for a given (wrong) distribution Q'' is used, compared to using a code based on the true distribution ''P. : \begin{matrix} D_{\mathrm{KL}}(P\|Q) & = & -\sum_x p(x) \log q(x)& + & \sum_x p(x) \log p(x) \\ & = & H(P,Q) & - & H(P)\, \! \end{matrix} where H''(''P,Q'') is called the cross entropy of ''P and Q'', and ''H(P'') is the entropy of ''P. The Kullback–Leibler divergence is always non-negative, : D_{\mathrm{KL}}(P\|Q) \geq 0, \, a result known as Gibbs' inequality, with D''KL(''P||''Q'') zero if and only if P'' = ''Q. The entropy H(P) thus sets a minimum value for the cross-entropy H(P,Q), the expected number of bits required when using a code based on Q'' rather than ''P; and the KL divergence therefore represents the expected number of extra bits that must be transmitted to identify a value x'' drawn from ''X, if a code is used corresponding to the probability distribution Q'', rather than the "true" distribution ''P. Originally introduced by Solomon Kullback and Richard Leibler in 1951 as the directed divergence between two distributions, it is not the same as a divergence in calculus. One might be tempted to call it a "distance metric" on the space of probability distributions, but this would not be correct as the Kullback-Leibler divergence is not symmetric -- that is, D_{\mathrm{KL}}(P\|Q) \neq D_{\mathrm{KL}}(Q\|P). -- nor it satisfies the triangle inequality. Still, being a premetric, it generates a topology on the space of generalized probability distributions, of which probability distributions proper are a special case. More concretely, if \{P_1,\cdots,P_n\} is a sequence of distributions such that : \lim_{n \rightarrow \infty} D_{\mathrm{KL}}(P_n\|Q) = 0 then it's said that P_n \xrightarrow{D} P . Pinsker's inequality entails that P_n \xrightarrow{\mathrm{D}} P \Rightarrow P_n \xrightarrow{\mathrm{TV}} P , where the latter stands for the usual convergence in total variation. Following Renyi (1970, 1961) the term is sometimes also called the information gain about X'' achieved if ''P can be used instead of Q''. It is also called the '''relative entropy', for using Q'' instead of ''P. The Kullback–Leibler divergence remains well-defined for continuous distributions, and furthermore is invariant under parameter transformations. It can therefore be seen as in some ways a more fundamental quantity than some other properties in information theory (such as self-information or Shannon entropy), which can become undefined or negative for non-discrete probabilities. Relation to other quantities of information theory Many of the other quantities of information theory can be interpreted as applications of the KL divergence to specific cases. The self-information, : I(m) = D_{\mathrm{KL}}(\delta_{im} \| \{ p_i \}), is the KL divergence of the probability distribution P(i) from a Kronecker delta representing certainty that i=m — i.e. the number of extra bits that must be transmitted to identify i'' if only the probability distribution ''P(i) is available to the receiver, not the fact that i=m. The mutual information, : \begin{array}{rl} I(X;Y) & = D_{\mathrm{KL}}(P(X,Y) \| P(X)P(Y) ) \\ & = \mathbb{E}_X \{D_{\mathrm{KL}}(P(Y|X) \| P(Y) ) \} \\ & = \mathbb{E}_Y \{D_{\mathrm{KL}}(P(X|Y) \| P(X) ) \} \end{array} is the KL divergence of the product P(X)P(Y) of the two marginal probability distributions from the joint probability distribution P(X,Y) — i.e. the expected number of extra bits that must be transmitted to identify X'' and ''Y if they are coded using only their marginal distributions instead of the joint distribution. Equivalently, if the joint probability P(X,Y) is known, it is the expected number of extra bits that must on average be sent to identify Y'' if the value of ''X is not already known to the receiver. The Shannon entropy, : \begin{array}{rl}H(X) & = \mathrm{(i)} \, \mathbb{E}_x \{I(x)\} \\ & = \mathrm{(ii)} \log N - D_{\mathrm{KL}}(P(X) \| P_U(X) )\end{array} is the number of bits which would have to be transmitted to identify X'' from ''N equally likely possibilities, less the KL divergence of the uniform distribution P''U(X)'' from the true distribution P(X) — i.e. less the expected number of bits saved, which would have had to be sent if the value of X'' were coded according to the uniform distribution ''P''U(X)'' rather than the true distribution P(X). The conditional entropy, : \begin{array}{rl}H(X|Y) & = \log N - D_{\mathrm{KL}}(P(X,Y) \| P_U(X) P(Y) ) \\ & = \mathrm{(i)} \,\, \log N - D_{\mathrm{KL}}(P(X,Y) \| P(X) P(Y) ) - D_{\mathrm{KL}}(P(X) \| P_U(X)) \\ & = H(X) - I(X;Y) \\ & = \mathrm{(ii)} \, \log N - \mathbb{E}_Y \{ D_{\mathrm{KL}}(P(X|Y) \| P_U(X)) \}\end{array} is the number of bits which would have to be transmitted to identify X'' from ''N equally likely possibilities, less the KL divergence of the product distribution P''U(X) P(Y)'' from the true joint distribution P(X,Y) — i.e. less the expected number of bits saved which would have had to be sent if the value of X'' were coded according to the uniform distribution ''P''U(X)'' rather than the conditional distribution P(X|Y) of X'' given ''Y. The cross entropy between two probability distributions measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution q , rather than the "true" distribution p . The cross entropy for two distributions p and q over the same probability space is thus defined as follows: : \mathrm{H}(p, q) = \mathrm{E}_pq = \mathrm{H}(p) + D_{\mathrm{KL}}(p \| q).\! KL divergence and Bayesian updating In Bayesian statistics the KL divergence can be used as a measure of the information gain in moving from a prior distribution to a posterior distribution. If some new fact Y=y is discovered, it can be used to update the probability distribution for X'' from ''p(x''|I) to a new posterior probability distribution ''p(x''|''y,I) using Bayes' theorem: : p(x|y,I) = \frac{p(y|x) p(x|I)}{p(y|I)} This distribution has a new entropy : H\big( p(\cdot|y,I) \big) = \sum_x p(x|y,I) \log p(x|y,I) , which may be less than or greater than the original entropy H(p(·|I)). However, from the standpoint of the new probability distribution one can estimate that to have used the original code based on p''(''x|I) instead of a new code based on p''(''x|''y'',I) would have added an expected number of bits : D_{\mathrm{KL}}\big(p(\cdot|y,I) \big\|p(\cdot|I) \big) = \sum_x p(x|y,I) \log \frac{p(x|y,I)}{p(x|I)} to the message length. This therefore represents the amount of useful information, or information gain, about X'', that we can estimate has been learned by discovering ''Y=y. If a further piece of data, Y''2=''y''2, subsequently comes in, the probability distribution for ''x can be updated further, to give a new best guess p''(''x|''y''1,y''2,I). If one reinvestigates the information gain for using ''p(x''|''y''1,I) rather than ''p(x''|I), it turns out that it may be either greater or less than previously estimated: : \sum_x p(x|y_1,y_2,I) \log \frac{p(x|y_1,y_2,I)}{p(x|I)} may be < = or > than \sum_x p(x|y_1,I) \log \frac{p(x|y_1,I)}{p(x|I)} and so the combined information gain does ''not obey the triangle inequality: : D_{\mathrm{KL}} \big( p(\cdot|y_1,y_2,I) \big\| p(\cdot|I) \big) may be <, = or > than D_{\mathrm{KL}} \big( p(\cdot|y_1,y_2,I)\big\| p(\cdot|y_1,I) \big) + D_{\mathrm{KL}} \big( p(\cdot |y_1,I) \big\| p(x|I) \big) All one can say is that on average, averaging using p''(''y''2|''y''1,''x,I), the two sides will average out. Bayesian experimental design A common goal in Bayesian experimental design is to maximise the expected KL divergence between the prior and the posterior. When posteriors are approximated to be Gaussian distributions, a design maximising the expected KL divergence is called Bayes d-optimal. Discrimination information The Kullback–Leibler divergence D''KL( ''p(x''|''H''1) || ''p(x''|''H''0) ) can also be interpreted as the expected '''discrimination information' for H''1 over ''H''0: the mean information per sample for discriminating in favour of a hypothesis ''H''1 against a hypothesis ''H''0, when hypothesis ''H''1 is true. Another name for this quantity, given to it by I.J. Good, is the expected weight of evidence for ''H''1 over ''H''0 to be expected from each sample. The expected weight of evidence for ''H''1 over ''H''0 is '''not' the same as the information gain expected per sample about the probability distribution p''(''H) of the hypotheses, :D''KL( ''p(x''|''H''1) || ''p(x''|''H''0) ) \neq ''IG = D''KL( ''p(H''|x) || ''p(H''|I) ). Either of the two quantities can be used as a utility function in Bayesian experimental design, to choose an optimal next question to investigate: but they will in general lead to rather different experimental strategies. On the entropy scale of ''information gain there is very little difference between near certainty and absolute certainty -- coding according to a near certainty requires hardly any more bits than coding according to an absolute certainty. On the other hand, on the logit scale implied by weight of evidence, the difference between the two is enormous - infinite perhaps; this might reflect the difference between being almost sure (on a probabilistic level) that, say, the Riemann hypothesis is correct, compared to being certain that it is correct because one has a mathematical proof. These two different scales of loss function for uncertainty are both useful, according to how well each reflects the particular circumstances of the problem in question. Principle of minimum discrimination information The idea of Kullback–Leibler divergence as discrimination information led Kullback to propose the Principle of Minimum Discrimination Information (MDI): given new facts, a new distribution f'' should be chosen which is as hard to discriminate from the original distribution ''f''0 as possible; so that the new data produces as small an information gain ''D''KL( ''f || f''0 ) as possible. For example, if one had a prior distribution ''p(x'',''a) over x'' and ''a, and subsequently learnt the true distribution of a'' was ''u(a''), the Kullback–Leibler divergence between the new joint distribution for ''x and a'', ''q(x''|''a) u''(''a), and the earlier prior distribution would be: : D_\mathrm{KL}(q(x|a)u(a)||p(x,a)) = \mathbb{E}_{u(a)}\{D_\mathrm{KL}(q(x|a)||p(x|a))\} + D_\mathrm{KL}(u(a)||p(a)), i.e. the sum of the KL divergence of p''(''a) the prior distribution for a'' from the updated distribution ''u(a''), plus the expected value (using the probability distribution ''u(a'')) of the KL divergence of the prior conditional distribution ''p(x''|''a) from the new conditional distribution q''(''x|''a''). This is minimised if q''(''x|''a'') = p''(''x|''a'') over the whole support of u''(''a); and we note that this result incorporates Bayes' theorem, if the new distribution u''(''a) is in fact a δ function representing certainty that a'' has one particular value. MDI can be seen as an extension of Laplace's Principle of Insufficient Reason, and the Principle of Maximum Entropy of E.T. Jaynes. In particular, it is the natural extension of the principle of maximum entropy from discrete to continuous distributions, for which Shannon entropy ceases to be so useful (see ''differential entropy), but the KL divergence continues to be just as relevant. In the engineering literature, MDI is sometimes called the Principle of Minimum Cross-Entropy (MCE) or Minxent for short. This is not entirely helpful. Minimising the KL divergence of m'' from ''p with respect to m'' is equivalent to minimising the cross-entropy of ''p and m'', since : H(p,m) = H(p) + D_{\mathrm{KL}}(p\|m), which is appropriate if one is trying to choose a least 'brain-damaged' approximation to ''p. However, this is just as often not the task one is trying to achieve. Instead, just as often it is m'' that is some fixed prior reference measure, and ''p that one is attempting to optimise by minimising D''KL(''p||''m'') subject to some constraint. This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by redefining cross-entropy to be D''KL(''p||''m''), rather than H''(''p,m''). Relationship to available work SurprisalsMyron Tribus (1961) ''Thermodynamics and thermostatics (D. Van Nostrand, New York) add where probabilities multiply. The surprisal for an event of probability p is defined as s≡k''ln''1/p. If k is {1,1/''ln''2,1.38×10-23} then surprisal is in {nats, bits, or J/K} so that, for instance, there are N bits of surprisal for landing all "heads" on a toss of N coins. Best-guess states (e.g. for atoms in a gas) are inferred by maximizing the average-surprisal S (entropy) for a given set of control parameters (like pressure P or volume V). This constrained entropy maximization, both classicallyE. T. Jaynes (1957) Information theory and statistical mechanics, Physical Review 106:620 and quantum mechanicallyE. T. Jaynes (1957) Information theory and statistical mechanics II, Physical Review 108:171, minimizes Gibbs availability in entropy unitsJ.W. Gibbs (1873) A method of geometrical representation of thermodynamic properties of substances by means of surfaces, reprinted in The Collected Works of J. W. Gibbs, Volume I Thermodynamics, ed. W. R. Longley and R. G. Van Name (New York: Longmans, Green, 1931) footnote page 52. A≡-k''ln''Z where Z is a constrained multiplicity or partition function. When temperature T is fixed, free-energy (T times A) is also minimized. Thus if T, V and number of molecules N are constant, the Helmholtz free energy F≡U-TS (where U is energy) is minimized as a system "equilibrates." If T and P are held constant (say during processes in your body), the Gibbs free energy G≡U+PV-TS is minimized instead. The change in free energy under these conditions is a measure of available work that might be done in the process. Thus available work for an ideal gas at constant temperature To and pressure Po is W = ΔG = NkTo'Θ['V/Vo']' where Vo = NkTo/Po and Θ['''x]≡x-1-''ln''x≥0 (see also Gibbs inequality). More generallyM. Tribus and E. C. McIrvine (1971) Energy and information, Scientific American '''224:179-186. the work available relative to some ambient is obtained by multiplying ambient temperature To by KL-divergence or net-surprisal ΔI≥0, defined as the average value of k''ln''p/po where po is the probability of a given state under ambient conditions. For instance, the work available in equilibrating a monatomic ideal gas to ambient values of Vo and To is thus W=ToΔI, where KL-divergence ΔI=Nk(Θ['''V/Vo]+ Θ[T/To]). The resulting contours of constant KL-divergence, at right for a mole of Argon at standard temperature and pressure, for example put limits on the conversion of hot to cold as in flame-powered air-conditioning or in the unpowered device to convert boiling-water to ice-water discussed hereP. Fraundorf (2007) Thermal roots of correlation-based complexity, Complexity '''13:3, 18-26. Thus KL-divergence measures thermodynamic availability in bits. Quantum information theory For density matrices P'' and ''Q on a Hilbert space the K–L divergence (or relative entropy as it is often called in this case) from P'' to ''Q is defined to be : D_{\mathrm{KL}}(P\|Q) = Tr(P( \log(P) - \log(Q))). \! In quantum information science it can also be used as a measure of entanglement in a state. Relationship between models and reality Just as KL-divergence of "ambient from actual" measures thermodynamic availability, KL-divergence of "model from reality" is also useful even if the only clues we have about reality are some experimental measurements. In the former case KL-divergence describes distance to equilibrium or (when multiplied by ambient temperature) the amount of available work, while in the latter case it tells you about surprises that reality has up its sleeve or, in other words, how much the model has yet to learn. Although this tool for evaluating models against systems that are accessible experimentally may be applied in any field, its application to models in ecology via Akaike Information Criterion are particularly well described in papersKenneth P. Burnham and David R. Anderson (2001) Kullback-Leibler information as a basis for strong inference in ecological studies, Wildlife Research 28:111-119. and a bookBurnham, K. P. and Anderson D. R. (2002) Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach, Second Edition (Springer Science, New York) ISBN 978-0-387-95364-9. by Burnham and Anderson. In a nutshell the KL-divergence of a model from reality may be estimated, to within a constant additive term, by a function (like the squares summed) of the deviations observed between data and the model's predictions. Estimates of such divergence for models that share the same additive term can in turn be used to choose between models. When trying to fit parametrized models to data there are various estimators which attempt to minimize Kullback–Leibler divergence, such as maximum likelihood and maximum spacing estimators. Symmetrised divergence Kullback and Leibler themselves actually defined the divergence as: : D_{\mathrm{KL}}(P\|Q) + D_{\mathrm{KL}}(Q\|P)\, \! which is symmetric and nonnegative. This quantity has sometimes been used for feature selection in classification problems, where P'' and ''Q are the conditional pdfs of a feature under two different classes. An alternative is given via the λ divergence, : D_{\lambda}(P\|Q) = \lambda D_{\mathrm{KL}}(P\|\lambda P + (1-\lambda)Q) + (1-\lambda) D_{\mathrm{KL}}(Q\|\lambda P + (1-\lambda)Q),\, \! which can be interpreted as the expected information gain about X'' from discovering which probability distribution ''X is drawn from, P'' or ''Q, if they currently have probabilities λ and (1 − λ) respectively. The value λ = 0.5 gives the Jensen-Shannon divergence, defined by : D_{\mathrm{JS}} = \tfrac{1}{2} D_{\mathrm{KL}} \left (P \| M \right ) + \tfrac{1}{2} D_{\mathrm{KL}}\left (Q \| M \right )\, \! where M'' is the average of the two distributions, : M = \tfrac{1}{2}(P+Q). \, ''D''JS can also be interpreted as the capacity of a noisy information channel with two inputs giving the output distributions ''p and q''. The Jensen-Shannon divergence is the square of a metric that is equivalent to the Hellinger metric, and the Jensen-Shannon divergence is also equal to one-half the so-called ''Jeffreys divergence (Rubner et al., 2000; Jeffreys 1946). Relationship to Hellinger distance If P'' and ''Q are two probability measures, then the squared Hellinger distance is the quantity given by : H^2(P,Q) = \frac{1}{2}\int \left|\sqrt{dP} - \sqrt{dQ}\right|^2. Noting that x'' − 1 ≥ log(''x), so that in particular, \sqrt{x}-1 \geq \frac{1}{2}\log(x) , we see that : \sqrt{\frac{dP}{dQ}}-1 \geq \frac{1}{2}\log\left(\frac{dP}{dQ}\right). Taking expectations with respect to Q'', we get : H^2(P,Q) \leq E_Q \log \frac{dQ}{dP} = D_{KL}(Q||P). Other probability-distance measures Other measures of probability distance are the ''histogram intersection, ''Chi-square statistic'', quadratic form distance, match distance, Kolmogorov-Smirnov distance, and ''earth mover's distance'' (Rubner et al. 2000). See also *Jensen-Shannon divergence *Bregman divergence *Akaike information criterion *Deviance information criterion *Bayesian information criterion *Quantum relative entropy *Information gain in decision trees *Solomon Kullback and Richard Leibler *Information theory and measure theory Notes References * Fuglede B, and Topsøe F., 2004, Jensen-Shannon Divergence and Hilbert Space Embedding, IEEE Int Sym Information Theory. * Rubner, Y., Tomasi, C., and Guibas, L. J., 2000. The Earth Mover's distance as a metric for image retrieval. International Journal of Computer Vision, 40(2): 99-121. * Matlab code for calculating KL divergence Category:Statistical theory Category:Entropy and information Category:Statistical distance measures Category:Thermodynamics